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Abstract — This paper Introduces a framework for regression 
with dimensionally distributed data with a fusion center. A 
cooperative learning algorithm, the iterative conditional expec- 
tation algorithm (ICEA), is designed within this framework. 
The algorithm can effectively discover linear combinations of 
individual estimators trained by each agent without transferring 
and storing large amount of data amongst the agents and the 
fusion center. The convergence of ICEA is explored. Specifically, 
for a two agent system, each complete round of ICEA is 
guaranteed to be a non-expansive map on the function space 
of each agent. The advantages and limitations of ICEA are also 
discussed for data sets with various distributions and various 
hidden rules. Moreover, several techniques are also designed to 
leverage the algorithm to effectively learn more complex hidden 
rules that are not linearly decomposable. 
Keywords: Distributed learning, heterogeneous data, re- 
gression, estimation. 

I. Introduction 

Distributed learning is a field that generalizes classical 
machine learning algorithms. Instead of having full access 
to all the data and being capable of central computation, in 
the framework of distributed learning, there are a number of 
agents that have access to only part of the data. And the agents 
(perhaps with a fusion center) are capable of exchanging 
certain types of information among one another Usually, due 
to privacy concerns, limited bandwidth and limited power, 
the content and amount of information shared are restricted. 
Research in distributed learning seeks effective learning algo- 
rithms and theoretical limits within such constraints. 

In terms of data structures, two types of distributed learning 
problems are: homogeneous data and heterogeneous data (or 
horizontally distributed / instance distributed data and verti- 
cally distributed / dimensionally distributed data). In terms 
of the organization of distributed learning systems, there are 
also basically two types: systems with a fusion center and 
systems without a fusion center In [1], [2] two important types 
of models, instance distributed learning with and without a 
fusion center, are discussed and several practical algorithms 
are provided. The relationship between the information trans- 
mitted amongst individual agents and the fusion center and the 
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ensemble learning ability are also discussed. There has also 
been some research on distributed learning with dimensionally 
distributed data, such as in [3]. However, the approach in 
[3] is not cooperative - individual agents first optimize their 
own estimator, and, given these estimators, the fusion center 
then constructs an optimal linear combination of them. In 
this paper, however, we will concentrate on a cooperative 
training algorithm, in which the fusion center coordinates the 
individual agents to optimize the ensemble estimator. 

It is also worth pointing out the connection between di- 
mensionally distributed learning and boosting for regression, 
which was first introduced in [5] and developed in many other 
works such as [6]. The algorithm developed in this paper can 
be viewed as an L2-regression boosting algorithm with extra 
constraints on the space from which the weak hypothesis can 
be selected. This perspective can bring insights of boosting to 
the problem of distributed regression. 

II. Description of the Problem 

In this paper, we discuss the problem of estimation (or 
regression) with dimensionally distributed data and a fusion 
center. The problem is specified as follows. 

There are M independent variables (or features) 
Xi,- ■ ■ , Xm and one dependent variable Y . The complete 
data set is composed of 

where n is the number of instances, Xij £ ffi. is the i-th instance 
of Xj, and y^ G M is the i-th instance of Y . 

We also assume that there exists a hidden deterministic 
function (or rule, or hypothesis) 

J, . TaM , m 



such that 



{Xii,Xi2,- ■ ■ ,XiM) +Wi 



where {wi]^^i is an independently drawn sample from a zero- 
mean random variable W that is independent of Xi ,■■■ , X]\j 
and Y. 

Suppose there are D agents, each of which has only limited 
access to certain features. Define Fj{j — I,- ■ ■ ,D) to be the 



set of features accessible by agent j, and define F — U^^iFj 
so that |F| = M. 

In order to concentrate on the "distributed part" of the 
problem, we assume that each agent is capable, given enough 
data, of learning the optimal minimum-mean-square-error 
(MMSE) estimator based on limited access to the features. 
More specifically, we assume that agent j can solve the 
optimization problem (given enough data) 



mm 



{({{Xt} 
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9A{Xt}t 
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where C, can be any M dimensional function satisfying some 
regularity conditions. Due to the independence and unbiased- 
ness of the noise, the above optimization problem can be 
simplified to 



min E 

9A{Xt}teF,) 



{({{XtheF) - gji{Xt}teF,)y 
The solution to the optimization problem above is 



gj{xF,)=E[(:{XF)\XF, ^XF,] 

where, for simplicity, we use xf to represent {xt}t^F \ that 
is, we assume that each agent is capable of estimating the 
conditional expectation of a function on F given the several 
dimensions comprising Fj, and based on enough data. 

Under this model, one way to deal with the distributed es- 
timation problem is another optimization problem formulated 
as follows: 



min E 



{^{XF)~p{gi{XF,] 



,9D{XFo))y 



where the functions gi,i — 1, • • • , D are fixed and given by 
the agents. The optimization problem above is intractable in 
its full generality, and this non-cooperative training approach 
does not take full advantage of the communication between 
individual agents and the fusion center (because it uses only 
one-way agent-to-center communication). 

However, if we restrict the function p to be of the additive 
form 



p{9i-,92,--- ,9d) =51+52 + 



9D, 



III. Communication and Memory Restrictions 

We assume that each agent can store all the data instances 
of its accessible features, i.e. agent j has access to data 
{xit]'2^iyt G Fj. We also assume that the fusion center 
can store {yi]^^i, which is equivalent to a one-dimensional 
data set. The agents and fusion center also have an additional 
one-dimensional memory (which can store all the instances 
of the dependent variable or one dimension of the features) 
for computation only, and there is no additional space beyond 
their own allocation. 

We further assume that the fusion center has two-way 
communication with all the agents. To be more specific, each 
agent can read and write on the one-dimensional data stored 
in the fusion center. 

Moreover, as noted above, we also require each agent to 
be capable of finding the ideal MMSE estimator (within a 
certain function space T) based on its accessible data. Fig[T] 
is an illustration of the structure of a typical dimensionally 
distributed learning system. 
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Fig. 1. Structure of a dimensionally distributed learning system for regres- 
sion. 



IV. Iterative Conditional Expectation Algorithm 



and optimize over qj,] — 1 , • • 
problem into a simplified version 



, D, i.e. we change this A. Basic Idea 



min E 
ffi,S2:--- ,3d 



{4>{Xf) - (.gi(XfJ + ■ • ■ + gD{XF^))y 



we then change a two-step optimization problem (first individ- 
ual agents optimize their own estimators, then the fusion center 
optimizes the ensemble) to a one-step cooperative optimization 
problem (the agents, with the coordination of the fusion center, 
optimize the sum of their estimators cooperatively). So we can 
seek an algorithm through which the agents can cooperatively 
solve the above problem. 



Motivated by the backfitting algorithm for additive models 
in [4], we propose the iterative conditional expectation algo- 
rithm (ICEA). The basic idea of this algorithm is very simple: 
First, agent 1 asks for the value of 4>{xf) for all the data 
instances from the fusion center, makes an estimate based on 
features in Fi and thereby obtains gi {xfi )■ Of course, gi {xfi ) 
cannot fully represent the true function 4>{xf) because it lies 
in a much smaller function space. 

Then, agent 1 sends back its estimate for all the data 
instances to the fusion center, and, the fusion center stores 
the residual (J>{xf) — 9i{xfi) for all the data instances. Then, 



agent 2 asks for the value of <j>(xp) — gi{xFi ) from the fusion 
center, makes an estimate based on features in F2 and thereby 
obtains 172 (2; F2)- This time, gi{xFi) + g2{xF2) is a better 
approximation of the true hidden rule (I){xf). 

This process is continued for all agents. When the process 
eventually returns to agent 1, it then asks the fusion center 
for the value of (I){xf) — '^j=i9j{xFj), thereby obtains 
Agi{xFi), and stores {gi + Agi){xFi) as the updated version 
of gi{xFi)- Then agent 1 sends that value of Agi{xFi) for 
all the instances to the fusion center, and the fusion center 
obtains the updated version of 4){xf) — X]i=i 9j{^Fj)- This 
continues to agent 2, and so on. 

After a few rounds of iteration, the algorithm will converge 
to a limit (we will show this below, under some further 
conditions). And the sum of the limit of the functions, i.e. 

D 

is the best linearly decomposed approximation of 4i{xf) in 
terms of MMSE. 



B. ICEA in Detail 

The following is a more precise description of the above 
algorithm (in terms of actual data instead of an evolution of 
ideal functions): 

g,ixF^)^0,yje{lr--,D}; 
Zi ^2/i,Vi e {I,--- ,n}; 

1 \^n 2. 

erroid ^ 0; 



while \err. 



■Id 



> e do 



for j from 1 to D do 

Ag,{xF,) <-TRAIN({^„ {x,t}teF,}'2=i)\ 
93 {xF, ) ^ gj {xF, ) + Affi {xF, ); 
Zi '^ Zi ~ Agj{{xit}teFj),yi; 
end 

1 v^^ 2. 

end 

function TRAIN({yj, {x^t}teF}f=l) return .gfxF); 
g{xF) = argmin^g^ X;Li (J/« ' 9{{x^t}teF)) 

Actually, the Ag^ixF^) ^TRAIN({z„ {x^t}teF,}f=i) step, 
given enough data, is essentially computing 

Jf\Fj Ji\^Fj)o^l^ 

where the function C,{xf) satisfies C,{{xit}teF) = Zi. This 
step is handled by individual agents, which we have assumed 
can be done perfectly. Therefore, if we concentrate on the 
functional evolution level (instead of on the actual data), the 
algorithm can be interpreted in terms of iterative conditional 



expectations: 

g,{xF^)^0,yje{l,---,D}; 

C{xf) ^ (f>{xF); 

errn^^^EieiXF)]; 
erroid ^ 0; 

while \erroid - errnew] > e do 
for j from 1 to D do 

Ag,{xF,) ^ E[C(Xf)|Xf, = xfJ; 
9] {xF, ) <- gj (xF, ) + Agj (xF, ); 
C{xf) ^ C{xf) - Agj{xF,); 
end 

err,,,^^^[e{XF)\, 
end 



Notice that the training errors are system "biases" caused by 
the limitation of our "linear decomposition", and the effects of 
random error caused by the finite number of training examples. 
These latter are the same as in classical learning theory and are 
not factors to be considered as resulting from the "distributed" 
nature of the problem. Thus, we do not consider them in our 
discussion. 

From the point of view of the fusion center, it simply sends 
its data that represents C,{xf) to an agent, waits for the agent 
to first update its own estimator gj{xF-), and then to send 
back the difference Agj{xF)- The fusion center then updates 
its data to represent ({^f) — ^gji^F-), and moves on to the 
next agent. 

From the point of view of an individual agent, the task is 
also straightforward: when agent j receives the data describing 
the latest version of ({^f) from the fusion center, it finds an 
optimal estimator Agj{xF) of C,{xf) based on all the data 
on features in Fj, uses Agj{xF) to update its own estimator 
gj{xF-), and then sends Agj{xF) back to the fusion center 
In general, the algorithm is simple, and each agent can use 
its own learning algorithm to determine (approximately) the 
conditional-mean estimator. 

It is worth noting that once the estimator is trained, it is 
distributively allocated throughout the entire system. Thus, 
when new data comes to the fusion center, it sends features 
to the corresponding agents and then sums their estimates to 
form a global estimate. 

V. Theoretical analysis of ICEA 

Intuitively, the above algorithm will repeatedly reduce the 
power of the residual stored in the fusion center. But does it 
converge? And, if so, what does it converge to and at what 
rate? Now let us look at the answers to these questions for 
some special cases. 

The monotonicity of the residual is easy to see. More specif- 
ically, the root-mean-square-error (RMSE) of the ensemble 
estimator is monotonically non-increasing. This is because in 
ICEA, we repeatedly fix all the individual estimators but one, 
and optimize only that one and use the new function to replace 
the old. Thus, the new estimator cannot be worse than the old 
one, and therefore, the RMSE must be non-increasing. 



Moreover, since the RMSE is always non-negative, the 
RMSE sequence is a monotonically non-increasing, lower 
bounded sequence, which guarantees the convergence of the 
algorithm (if we use the change in RMSE as the convergence 
criterion, which is what we did in the algorithm previously 
shown). However, there is no guarantee of uniqueness (dif- 
ferent initial conditions might lead to different limits), nor 
of equivalence between the limits and the solution to the 
optimization problem given in the previous section. 

So in the following subsections, we discuss the functional 
convergence of ICEA under some special cases. 

A. Non-expansive map for two agent case 

For the two-dimensional, two-agent case, the algorithm is 
intended to solve the following optimization problem: 



In order to show that T is a non-expansive map, it is equivalent 
to prove the following inequality: 



E 



{E[E[g{Xi)\X2]\Xi]f <E[g^X, 



where 5(Xi)= 31 (Xi)-.9i*(Xi). 

Define /ig — E[g{Xi)], and notice two facts: 



E 


'(E[ngiXi)\X2]\Xi]f 


-^l 


E 


(E[E[g(Xi)|X2]|Xi]- 


fJ-gf 



and 



min 1 

91.92 



{(j){xi,x2) - gi{xi) - g2{x2)Y 



E[g\X,)\-^ll = E[{g{X,)-^l,f 
Then, the original inequality is equivalent to the inequality 



It is straightforward to show that the optimal solution 
gi,opt{xi) and g2.opt{x2) should satisfy equations 

gi{xi) ^¥.[{(j){xi,X2) - g2{x2)) \Xi = Xi] 

and 

g2{x2) =E[((/)(a;i,a;2) - gi{xi)) \X2 = X2\ 

simultaneously. 

On the other hand, if we apply ICEA to the two dimensional 
distributed learning problem, we will iteratively find the so- 
lutions to the equations above. And (hopefully) the solution 
will converge to the desired gi,opt{xi) and g2,opt{x2)', i-C- 
ICEA enables us to approximate the solution to a difficult 
optimization problem by solving a sequence of simplified 
optimization problems iteratively. Of course, rigorously, we 
need to prove the convergence of this algorithm and the 
uniqueness of its limit. 

Ideally, if we can show that each round of the algorithm 
is actually a contractive map on a well-defined metric space, 
it is easy to apply the fixed point theorem to guarantee the 
uniqueness of the limit. 

Unfortunately, we can prove only a weaker conclusion: for 
the two-agent case, ICEA, after each complete round (i.e. 
after each agent updates its estimator), is equivalent to a non- 
expansive map. 

First we need to define a suitable measure of distance 
between two functions g{xp) and h{xp): 

d {g{xF),h{xF)) = E [{g{XF) - h{Xp)f] . 

The algorithm performs the following operation to a func- 
tion gi{xi) after each complete round (denote the mapping as 

T): 

T{9i{X,)] = 

E[0(Xi,X2)-E[0(Xi,X2)-.gi(Xi)|X2]|Xi=xi]. 

Therefore, the distance between T{gi(xi)} and T{gl{xi)} is 
given by 

E [(E[E[9i(Xi) -.g*(Xi)|X2]|Xi])' 



{nn9{Xl)\X2]\x^]^^lgY 



<E 



{g{X,) - ^igY 



Then, we have that the left hand side satisfies 



LHS 



= E 

< E 



{E[E[g{X,)\X2]\X,]-nng{X,)]\X,]y 
iE[E[giX,)\X2] -E[g{Xi)]\X,]f^ 
Emg(Xi)\X2]-E[g{X,)]f\X,] 
= E[{E[g{Xi)\X2]-E[g{Xi)]f . 

The inequality step is because of Jensen's inequality: 

0[E(X)] < E[^{X)] 



when (/) is a (measurable) convex function. Moreover, we also 
have 



E 



Var 



{E[g{X,)\X2]Y 



{E[giX,)\X2]Y 



iE[giX,)\X2]-E[g{X,)]) 
and hence, 

LHS < Var 
In addition, we have that the right hand side satisfies 

RHS = YaT[g{Xi)], 
and an important relationship: 

Var[5(Xi)] > Var [(E[5(Xi)|X2])'] + E [Var[g(Xi)|X2]] 
Therefore, 

RHS > LHS + E [YsiT[g{Xi)\X2]] > LHS, 



and hence we have proven that T is an non-expansive map. 
This result can be easily generalized to the two-agent, high 
dimensional case. 



B. Contractive map for a special case 

Non-expansiveness is weaker than contractiveness, and there 
is no general "fixed-point" theorem. But, under certain condi- 
tions, we are able to draw stronger conclusions. For instance, 
if we restrict the problem to the two-dimensional case, restrict 
the hidden rule to be a finite order bivariate polynomial, 
and restrict the distribution of the dependent variables to the 
two-dimensional joint Gaussian distribution with correlation 
coefficient \p\ < 1, then ICEA can be shown to be a contractive 
map. 

Moreover, for the two-agent, two-dimension Gaussian case 
above, we can also measure the speed of convergence by the 
contractive factor of the contractive map. It can be shown that 
the factor is p'^, i.ej^ 

d (T(5(Xi)), T{h{X,))) < pU (g(Xi), h{X^)) . 

So when the two dimensions are weakly correlated, the 
convergence can be very fast. 

VI. Simulation results of ICEA 

A. Simulation in Terms of Function Evolution 

A detailed simulation of the two-agent, two-dimensional, 
finite-order bivariate-polynomial hidden rule, jointly-Gaussian 
case is shown below. The hidden rule is 



(a;i,a;2) 






2, 



and {Xi , X2) is jointly Gaussian with zero mean, unit variance 
and correlation coefficient p ~ 1/2. 

On initializing gi{xi) and 52(2:2) to be and applying 
ICEA to the problem, we get the following results shown in 
Table U 



Round 


gi(xi) and 32(3:2) 


RMSE 


1 


2 + .7500a'i +xi + .2500a;f 






-.6563x2 + .4688a;^ 


1.4941406250 


2 


2 + .5508x1 +x^ + .1914x^ 






-.4907x2 + .4761x^ 


1.2974381447 


3 


2 + .4598x1 +x^ + .1905x^ 






-.4442x2 + .4762x^ 


1.2864467088 


4 


2 + .4364x1 +xf + .1905xf 






-.4325x2 + .4762x^ 


1.2857600621 


5 


2 + .4305x1 +xl + .1905x^ 






-.4295x2 + .4762x^ 


1.2857171467 


6 


2 + .4291x1 +x:f + .1905xf 






-.4288x2 + .4762x^ 


1.2857144645 


Limit 


2 + 3/7 xi +x^+4/21 xf 






-3/7 X2 + 10/21 x^ 


9/7 



TABLE I 

Step-by-step results of the ICEA. 



The evolution of the functions of Table H] is shown in Fig|2] 
and Figl3] It is quite clear that there is no visible difference 
after a few rounds of iterations. 




Fig. 2. The convergence of gi(xi). 




Fig. 3. The convergence of g2(x2)- 

Moreover, the speed of convergence can be measured (ap- 
proximately) by the surplus RMSE (the difference between the 
RMSE of the ensemble estimator after the nth iteration and 
the limit RMSE) as shown in Fig|4] Also notice that in the 
semi-logarithm plot, the slope k of the line is —2.79375, and 
^gfc^i/4 ^ g g _ p^ which is compatible with our theory in 
the previous section. 



IS RMSE vs. Number c 



where the limit function is the unique solution to equations 

/oo 
{(j){xi,X2) - g2{x2)) fx2\Xi{x2\xi)dx2 
-OQ 
/OO 
{(j){xi,X2) - giixi)) fxi\X2i^l\^2)dX2- 
-C30 



This is shown in the appendix. 



Fig. 4. The rate of RMSE convergence for the two-agent, two-dimensional, 
joint Gaussian case. 



B. Simulation on Real Data 

Our discussion above always assumes enough training data 
and perfect individual agents that can find the MMSE estima- 
tor. However, to justify the efficacy of ICEA in solving real 



problems, we test the algorithm with real data, contrary to the 
functional "simulation" we did in the previous section. 

In order to compare the distributed regression to other multi- 
dimensional regression algorithms, we use three functions used 
in [7] (originally from [8] and [9]) as the hidden rule to 
generate our simulation training data sets. The three functions 
are 

• Friedman- 1: 

0(x) = 10sin(7ra:;i.T2) + 20(a;3-l/2)^ + 10a;4 + 5a;5+w 

where Xj ^ U[0,l],j = I . . . 5; 

• Friedman-2: 



a;2a;3 



X2X4, 



where 



xi - C/[l,100], 
X2 -C/[407r,5607r], 

X3,X5-C/[0,1], 

Xi '- C/[l,ll]. 



Friedman-3: 



(f>{x) = tan-i j ^^ ^^ j + w 

where the distribution of the features are the same as that 
of Friedman-2. 

All the feature variables are independent, and before running 
the algorithm, the outcomes are normalized to the range [0, 1]. 
Also, to highlight the effect of distributiveness of the system, 
the independent white noise w is set to zero in our simulation. 
Also, it is worth pointing out that in Friedman-2 and Friedman- 
3, feature X^ is irrelevant, and is set up as a test of the 
algorithm's resistance to irrelevant features. 

Moreover, we did the experiments on three different types 
of distributed regression systems: 

• System- 1: five 1 -dimensional agents 

{X,},{X2},{X3},{X4},{X5} 

• System-2: three 2-dimensional agents 

{Xi,X2}, {X2,X3}, {Xi,Xz} 

• System-3: two 5-dimensional agents 

{Xi,X2,X3}, {X2,X4,X5} 

In all these cases, all the features are fully covered by the 
union of the features observable by individual agents. 

We use a regression tree - a commonly used "weak learner" 
in the boosting algorithms - as our learning algorithm for 
the individual agents. Notice that L2-regression boosting (in- 
troduced in [10]) is equivalent to a one-agent system, in 
which all the dimensions are accessible by the agent. In this 
sense, L2-regression boosting is the "limit algorithm" of the 



ICEA for distributed systems. So it is natural to compare the 
performance of distributed system to L2-regression boosting. 
Also, to compare the cooperative algorithm ICEA to non- 
cooperative algorithm (like the algorithms in [3]), we also ran 
the data on a hierarchical algorithm. The individual agents are 
identical to those of ICEA, and the fusion center can further 
train an estimator using L2 -regression boosting, taking the 
output of the agents as the features. 

Running three different types of algorithms on three dif- 
ferent data sets, with 2000 training data points and 4000 test 
data points, the mean squared errors are shown in Table HH and 
the detailed plot of training/test error after each round of L2- 
regression boosting and ICEA with three different distributed 
systems running on data set Friedman-2 are shown in Fig. |5] 



Data Set 


System 


ICEA 


Hierarchical 


L2 Boosting 




1 


.0050 


.024 
.061 
.036 




Friedman- 1 


2 


.0014 


.0051 




3 


.0039 






1 


.010 


.075 
.079 
.13 




Friedman-2 


2 


.0012 


.00066 




3 


.00088 






1 


.0082 


.35 
.23 
.31 




Friedman-3 


2 


.0062 


.0034 




3 


.0035 





TABLE 11 

Test error (mean squared error) of L2 -regression boosting, 

ICEA AND HIERARCHICAL ON DATA SETS FRIEDMAN- 1 ,-2 AND -3. 



Simulation oi Friedman-2 Data, 2000 training points 



L2 Boosting Training 
L2 Boosting Test 
Two 3-dim Agent Training 
Two 3-dim Agent Test 
Three 2-dim Agent Training 
Three 2-dim Agent Test 
Five 1-dim Agent Training 
Five 1-dim Agent Test 



jyv^v'^V^Vv^^ ' 




log(Number of Rounds or Projections) 

Fig. 5. Simulation using Friedman-2 data with 2000 training data points 
and 4000 test data points. The training error and test error for L2 -regression 
boosting and ICEA with three different distributed systems: five 1-dim agents, 
three 2-dim agents and two 3-dim agents. The training error (dashed lines) 
and test error (solid lines) of ICEA decreases monotonically, and converges 
quite fast. And systems with high-dimensional individual agents have lower 
training error and test error than systems with low-dimensional individual 
agents. 



As expected, L2 -regression boosting performs best for most 
of the cases, except for Friedman- 1, the hidden hypothesis 



of which is basically additive. Because System-2 is not so 
complicated yet is complex enough to fully capture the model, 
the ICEA algorithm running on System-2 performs best. 
However, for Friedman-2 and Friedman-3, where the hidden 
models are no longer additive, L2-regression boosting, with 
full access to all the dimensions, outperforms other algorithms. 
And for ICEA, the performance is better when the individual 
agents have access to more dimensions, capable of describing 
more complex coupling among the features. The hierarchical 
algorithm works (thought not so well) for additive models, yet 
the algorithm performs poorly for data sets with complicated 
functions where there is strong coupling among the features. 
Since the estimators used for individual agents in ICEA and 
the hierarchical algorithm are identical (regression trees), the 
performance difference can be attributed to the benefit of 
applying cooperative training in ICEA. 

VII. Problems of ICEA and Possible Improvements 

A. Limitations of ICEA 

Of course, since we restrict our approximation of 4'{xp) 
to the sum '^j^i9j{xfj), we lose richness of the ensemble 
estimator. For instance, for the function (j){xi,X2) = a;iX2 
with Xi , X2 being independent standard Gaussian variables, 
if agent 1 has access only to dimension 1, and agent 2 has 
access only to dimension 2, then the linear model estimator is 
simply 0, which means nothing can be learned. So restricting 
to a linear form can lead to some serious problems. However, 
there are several ways to solve this problem and greatly expand 
the efficacy of ICEA. 

First, we can linearize the function (f){xp). For instance, in 
the example above, if we take the logarithm of the function 
4>{xi,X2) = X1X2, then we get \og{(j){xi,X2)) = log(xi) + 
log(x2). In this case, we can use the linear additive model 
to accurately depict the ensemble function. Therefore, with a 
proper non-linear transformation, we can greatly expand the 
scope of problems that can be optimally solved by our linear 
additive model. 

Second, we can project the function 4'{xf) on more linear 
combinations of the features of the agents. For instance, in the 
above problem, if we have two other agents that have access to 
data xi +X2 and xi —X2, then the model can also be accurately 
learned by these two agents. In practice, because there is 
significant redundancy amongst the data of the agents, we 
don't need to intentionally calculate these linear combinations 
(which requires more communication resources). Instead, we 
simply take advantage of the redundancies contained in the 
data, which is often considered to be a hazard in some learning 
algorithms. 

B. Developing More Intelligent Algorithms 

Boosting for regression sheds light on the design of al- 
gorithms more "intelligent" than ICEA, which simply refits 
the residual on each agent one after another. ICEA can be 
improved in more ways than one, in terms of increasing speed 



of convergence, finding a natural stopping rule to avoid over- 
training and to reduce generalization error. Several rudimen- 
tary experiments have shown that, instead of iteratively refit- 
ting the residual one agent after another to reduce merely the 
training error, we are better off if we choose among the agents 
more intelligently, and take both the training error on the 
residual and the complexity of the model into consideration. 
For instance, a greedy algorithm that always chooses the agent 
that provides the minimum training error can greatly increase 
the speed of convergence, and an algorithm using the size 
of the decision tree as a penalty term can effectively reduce 
overtraining. It is worthwhile to explore more subtle rules of 
selecting estimators from agents and more delicate ways to 
combine them. ICEA is perhaps the most intuitive algorithm, 
but far from the optimal one. 

VIII. Conclusions 

By restricting the ensemble estimator to an additive form 
(linear combination of individual estimators), we have devel- 
oped an iterative algorithm (ICEA) that is guaranteed, under 
certain conditions, to converges to a unique limit function 
(or rule, or hypothesis). This limit is an approximation of 
the true function, and, with the help of some additional 
features (linearization, redundant data), the approximation can 
be quite accurate. ICEA also works quite well with real data, 
with enough training points and properly selected individual 
estimators. By sending only the predictions and withhold the 
data, ICEA also preserves the privacy of data of individual 
agents. There are still many aspects of the algorithm that 
can be changed to improve the performance of distributed 
regression, and these are of interest for further investigation. 

IX. Appendix 
Lemma 1 Suppose g is a polynomial of order Ad, 

M 

9{x) = ^a„x", 
and g'{x) is given by 

/OO / /"OC \ 

( / 9{x)fx\Y{x\y)dx\ fY\x{y\x)Ay. 
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Then, with the additional assumption that 
9{x)fx{x)dx = 0. 

— CXD 

we have inequality 

ir^9'Hx)fx{x)dx ^ ^ 

lZc9'^(^)fxix)dx 

where fx\Y, Iyix, and fx are all probability densities de- 
rived from the joint Gaussian distribution of zero mean, unit 
variance and correlation p. 

Proof: The conditional distribution of X given Y and the 
distribution of Y given X are 

X\Yr^Nipy,l-p^), 



Y\X ^ N{px,l- p^). 



Therefore, we have 
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Notice that the exponential term, with proper manipulation, 
can be expressed in the form of the sum of Hermitian 
polynomials. Thus, on defining 



X = — =x, s = t\ — - — i, 
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we have 
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exp{p^xt+^^t^} =. e^^^-^' ^J2^kiX) 
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Then the expression of g'{x) can be rewritten as 
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= ^a„if„(X) 
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Therefore, we have a closed-form expression for g'{x), which 
is also an M'^-order polynomial: 
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By computation, we can derive 
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Moreover, since g is of zero mean. 
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Therefore, 
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If the hidden rule (^(xi, X2) is restricted to a bivariate poly- 
nomial with finite order M and zero mean, and gi{xi),g2{x2) 
are both initialized as 0, then after each iteration, gi{xi) and 
92{x2) will remain in the space of zero-mean polynomials of 
finite order M. If we define the distance between two poly- 
nomials gi{x) and .92(0;) as /^ (51(2;) -32(2;)) /x(x)dx, 
then, according to Lemma 1, the map T that convert g{x) to 
g'{x) is a contractive map. Because under our definition of 
the distance, the space of zero-mean polynomials is complete, 
we can apply the contractive mapping theorem to guaran- 
tee the functional convergence of ICEA for the two-agent, 
two-dimensional, joint-Gaussian, finite-order-polynomial case. 
Moreover, also by the lemma, the "contractive factor" of the 
map is p**. As for the hidden rule with non-zero mean, the bias 
is also addressed by the constant term of the first agent, which 
will remain the same in the following iterations and hence has 
no influence on the convergence. 
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